Machine-learning based prediction and analysis of prognostic risk factors in patients with candidemia and bacteraemia: a 5-year analysis

Bacteraemia has attracted great attention owing to its serious outcomes, including deterioration of the primary disease, infection, severe sepsis, overwhelming septic shock or even death. Candidemia, secondary to bacteraemia, is frequently seen in hospitalised patients, especially in those with weak immune systems, and may lead to lethal outcomes and a poor prognosis. Moreover, higher morbidity and mortality associated with candidemia. Owing to the complexity of patient conditions, the occurrence of candidemia is increasing. Candidemia-related studies are relatively challenging. Because candidemia is associated with increasing mortality related to invasive infection of organs, its pathogenesis warrants further investigation. We collected the relevant clinical data of 367 patients with concomitant candidemia and bacteraemia in the first hospital of China Medical University from January 2013 to January 2018. We analysed the available information and attempted to obtain the undisclosed information. Subsequently, we used machine learning to screen for regulators such as prognostic factors related to death. Of the 367 patients, 231 (62.9%) were men, and the median age of all patients was 61 years old (range, 52–71 years), with 133 (36.2%) patients aged >65 years. In addition, 249 patients had hypoproteinaemia, and 169 patients were admitted to the intensive care unit (ICU) during hospitalisation. The most common fungi and bacteria associated with tumour development and Candida infection were Candida parapsilosis and Acinetobacter baumannii, respectively. We used machine learning to screen for death-related prognostic factors in patients with candidemia and bacteraemia mainly based on integrated information. The results showed that serum creatinine level, endotoxic shock, length of stay in ICU, age, leukocyte count, total parenteral nutrition, total bilirubin level, length of stay in the hospital, PCT level and lymphocyte count were identified as the main prognostic factors. These findings will greatly help clinicians treat patients with candidemia and bacteraemia.


INTRODUCTION
Candidemia and bacteraemia frequently occur in hospitalised or critical care patients with unfavourable prognosis and high mortality (Raoult & Richet, 2011;Bloos et al., 2013;Heimann et al., 2015;Kullberg & Arendrup, 2015;Logan, Martin-Loeches & Bicanic, 2020). Candidemia is diagnosed in >250,000 people annually worldwide and causes >50,000 deaths (Arendrup, 2010). According to a population-based study from the United States of America (USA) and a database-based systematic analysis from Europe, bacteraemia is the fourth leading cause of mortality (following cardiac diseases, combined lung and larynx cancers and cerebrovascular diseases) (Jensen et al., 2011;Sogaard et al., 2011). The mortality rate of candidemia was 29% according to a population-based study from the USA, 31% according to a study from Spain, 54% according to a multi-centre study from Brazil and 60% according to a survey conducted in South Africa (Colombo et al., 2006;Cleveland et al., 2012;Kreusch & Karstaedt, 2013;Puig-Asensio et al., 2014). In addition, on the basis of a study from the University of Pennsylvania, catheter-associated bacteraemia was discovered to be the 12th leading cause of death in USA (Umscheid et al., 2011). Furthermore, the distribution and prevalence of different Candida species differ according to regional discrepancies and patient populations (Pappas et al., 2018). Among fungal infections, Candida albicans infection is the most prevalent infection leading to death; however, the incidence of non-albicans candidemia has increased over the past decades worldwide (Pfaller et al., 2011;Castanheira et al., 2016;Lamoth et al., 2018). Moreover, the number of cases of Candida albicans infection dramatically decreased during the past decade in the USA and is less than half of that reported previously (Lockhart et al., 2012;Matsumoto et al., 2014;Pfaller, Jones & Castanheira, 2014). Because candidemia is frequently reported and Candida is the third most common causative agent of infection in intensive care units (ICUs) worldwide (17%), medical support and intensive treatment are challenging (Vincent et al., 2009;Colombo et al., 2014;Lortholary et al., 2014;Chakrabarti et al., 2015). Bloodstream infections often lead to severe diseases with high morbidity and mortality and can be acute or chronic (Opota et al., 2015). Candidemia and bacteraemia are associated with a heavy socioeconomic burden and widespread prevalence (Martinez & Wolk, 2016). Studies reporting on the epidemiological features and risk factors of candidemia and bacteraemia are limited; therefore, further investigation and integration are required.
Machine learning is a rapidly developing technique that is widely used for analysing medical information and making clinical decisions. Programming algorithms through machine learning can define rules based on extensive data and interpret unknown relationships between factors (Jordan & Mitchell, 2015;Peiffer-Smadja et al., 2020). Random forest (RF), logistic regression (LR) and support vector machine (SVM) are widely used classification tools in bioinformatics and medical fields. RF is a supervised tree-based ensemble machine learning methodology, whereas SVM is a nonparametric, supervised and kernel-based statistical learning approach (Saberioon et al., 2018). The relationship between one or more independent factors and a binary dependent variable can be estimated using logistic regression (Schober & Vetter, 2021). Epidemiological characteristics and risk factors can be processed, analysed and predicted by using machine learning. Therefore, we used machine learning methods in this retrospective study to analyse the clinical information and find out risk factors for patients with candidemia and bacteraemia.

Human ethics
This study was conducted in accordance with the declaration of Helsinki. It was approved by The Human Ethics Review Committee of the First Hospital of China Medical University (number: 2021-260). The ethics review board of the First Hospital of China Medical University exempted the acquisition of informed consent because it was a retrospective study. During data collection and preparation of the manuscript, all patients' information was considered to be confidential.

Patient selection
Patients were selected and data were collected as described in our previous study (Li et al., 2020). Specially, all data on Candida recovered from the blood of patients with invasive candidal infection were acquired (2008 version of EORTC/MSG criteria). We used the date when the first positive result of blood culture was obtained as the onset of candidemia and bacteraemia. The data set captured relevant information from the selected patients, including patient clinical features, risk factors for candidemia and bacteraemia, treatment and survival status at discharge, haematological diagnoses, Candida and bacterial test results, antifungal therapy and so on. The hospitalisation of each patient was regarded as an event. It should be noted that it was considered to be a new event once the patient was re-hospitalised and received new treatments.

Definition
Persistent candidemia was defined as a condition in which the blood culture tests yielded positive results with the same Candida species after 7 days of initiating appropriate therapy.

Microbiological test
The collected blood samples were cultured for 5 days, then we selected the positive blood samples and transferred them to blood AGAR plates. Fungal isolates and bacterial isolates were cultured at 35 C for 48-72 h subsequently. We carried out gram staining and microscopic examination simultaneously. Strain identification was performed on a VITEK two Compact (Bio-Merieux SA, Marcy l'etoile, France), including fungal isolates and bacterial isolates. Drug susceptibility tests were performed followed the reagent instructions and the "national clinical test operating procedures" using the ATB FUNGUS three (Bio-Merieux SA, Marcy l'etoile, France). The ATB Fungus 3 yeast-like fungi was applied into a drug susceptibility test box, and the minimum inhibitory concentration (MIC) value was determined according to the CLSI m27-a3 and m27-s4 antifungal susceptibility test standards. Candida ATCC6258 and Candida albicans ATCC90028 were used as the quality control strains.

Machine learning
We pre-processed the data, removed missing cases (missing data with 50% features) and filled the mean value of missing data. We performed five-fold cross-validation to analyze these data. The predictive power of the model was measured using the average area under the curve (AUC) of the receiver operating characteristic curve based on five-fold cross-validation process. In addition, random forest, logistic regression and support vector machine were used to develop the final prediction model. We applied the model of random forest to identify important features.

Statistical analysis
Statistical analysis was performed using the SPSS 20.0 software. Non-normally distributed quantitative data were expressed as median and quartile ranges [M(P25, P75)] and the intergroup comparisons were performed using the Mann-Whitney test. Qualitative data were described by relative numbers and the intergroup comparisons were made using chi-square test.

In vitro antifungal susceptibility test
Of the 367 patients, 353 patients with information on drug sensitivity were selected. Approximately 20 patients were resistant to at least one drug. Amphotericin B exhibited strong efficiency because all 353 patients were sensitive to it. A total of 352 patients were sensitive to flucytosine, with only one patient being resistant to it, which showed that amphotericin B and flucytosine may be potential drugs for treating Candida infection. In addition, 339 and 335 patients were sensitive to voriconazole and itraconazole, respectively. Moreover, 17 (17/20, 85%) patients with Candida glabrata infection showed strong dose-dependent drug sensitivity to fluconazole, which was not observed in any other groups, suggesting that fluconazole is a candidate drug for the treatment of Candida glabrata infection. However, 6 (6/25, 24%) patients with Candida tropicalis infection were resistant to fluconazole and voriconazole. Detailed information is provided in Fig. 3 and Table S3.

Risk factors for Candida albicans and non-Candida albicans infections
Clinical information and statistical data are presented in Table 1

Analysis of risk factors in patients with persistent and non-persistent Candida infections
The clinical information and statistical data of patients with persistent and non-persistent Candida infection are provided in Table 2. Persistent Candida infection was associated with prolonged hospital and ICU stays, total parenteral nutrition, recent surgery (within the past 2 weeks) and central venous catheter insertion. In addition, leukocyte, neutrophil and lymphocyte counts were significantly elevated in patients with persistent Candida infection, with statistically significant differences. Detailed information is provided in Table 2.

Analysis of risk factors in patients with single and multiple fungal infections
Of the 367 patients with candidemia and bacteraemia, 59 (16.1%) patients had multiple fungal infections, whereas 308 (83.9%) patients had a single fungal infection. Clinical information and statistical data are presented in Table 3. Patients with multiple fungal infections were older (65 vs 61 years, respectively, based on the median age) and had longer hospital stays (55 vs 30 days, respectively, based on the median) or ICU stays (19 vs 0 days, respectively, based on the median) than patients  Note: a Described by median and quartile, and the statistic was the Z value; other items were described as numbers (n -%) and the statistic was the χ2 value. b Fisher χ2 value. multiple fungal infections had solid tumours, whereas 59.42% of patients with a single fungal infection and 25.42% of patients with multiple fungal infections had recent surgery (within the past 2 weeks). Furthermore, the lymphocyte count was significantly lower in patients with a single fungal infection (0.64 × 10 9 /L, based on the median) than in patients with multiple fungal infections (0.95 × 10 9 /L, based on the median), with statistically significant differences. Detailed information is provided in Table 3.

Prediction of risk factors related to death using machine learning
Random forest, logistic regression and support vector machine were used to predict death and evaluate performance ( Table 4). The receiver operating characteristic (ROC) curves were generated based on our datasets (Fig. 4). The random forest played an important role in classification and regression and showed excellent performance. It was used to interpret different characteristics of patients and predict the risk factors for candidemia and bacteraemia. The results revealed that serum creatinine level, endotoxic shock, length of stay in ICU, age, leukocyte count, total parenteral nutrition, total bilirubin level, length of stay in the hospital, PCT level and lymphocyte count were the most important prognostic factors for concomitant candidemia and bacteraemia. Detailed information is Table 4 Performance of the machine-learning algorithms. provided in Table 5. The RF model showed satisfactory performance with these 10 characteristics in our datasets, with an AUC value of 0.8505 (Fig. 4).

DISCUSSION
To the best of our knowledge, researches on concomitant candidemia and bacteraemia are limited. We collected the detailed clinical information of 367 patients with candidemia and bacteraemia from January 2013 to January 2018 in a provincial medical centre in the northeast of China. Among all the selected patients in this study, 169 (46.0%) patients stayed in the ICU during hospitalisation, 358 (97.5%) patients stayed in the hospital for >10 days and 224 (61.0%) patients had multiple hospitalisations within the past 2 years. Most patients had diseases including hypoproteinaemia (249/367, 67.8%) and solid tumours (182/367, 49.6%). These conditions may be associated with weak immunity and long-term hospitalisation; moreover, Candida colonisation may have worsened the condition of these patients (Nami et al., 2019). Some recent studies have reported features similar to those of the abovementioned conditions (Keighley et al., 2019;Koehler et al., 2019). In addition, urinary catheter insertion (285/367, 77.7%), gastric tube insertion (213/367, 58.0%), central venous catheter insertion (229/367, 62.4%), drainage catheter insertion (245/367, 66.8%) and total parenteral nutrition (298/367, 81.2%) were identified as high-risk factors for candidemia and bacteraemia. As shown in previous studies, invasive medical support may be associated with the deteriorating state of patients with candidemia and bacteraemia (Ang et al., 1993;Fisher et al., 2011;Janum & Afshari, 2016;Ala-Houhala & Anttila, 2020).
Based on the results of machine learning in this study, we found that the most important predictors of death in patients with concomitant candidemia and bacteraemia included serum creatinine level, endotoxic shock, length of stay in ICU, age, leukocyte count, total parenteral nutrition, total bilirubin level, length of stay in the hospital, PCT level and lymphocyte count. Studies have reported that serum creatinine levels were increased in Candida infection, which may lead to renal dysfunction and increase infection-related mortality (Wong et al., 1982;Bellomo et al., 2017;Ronco, Bellomo & Kellum, 2019;Arase et al., 2020). Endotoxic shock was associated with Candida infection and was one of the main causes of morbidity and mortality worldwide; however, different features might be exhibited according to different species and the immune status of patients (Duggan et al., 2015;Eggimann et al., 2015;Poissy et al., 2020). Prolonged ICU stays could lead to a higher risk of complications and death based on the duration of intensive care (Ruping, Vehreschild & Cornely, 2008;Vincent et al., 2009). Age was also an important factor; elderly patients had low immunity, chronic diseases and multi-organ failure, making them susceptible to invasive Candida infections such as candidemia (Pluim et al., 2012;Mundula et al., 2019). The leukocyte count was associated with mortality in patients with Candida infections including candidemia (Riley & Rupert, 2015). In addition, total parenteral nutrition could increase the risk of complications and death (Ruiz-Ruigómez et al., 2018;Logan, Martin-Loeches & Bicanic, 2020;Poissy et al., 2020). The total bilirubin level was associated with mortality and factors such as age and primary diseases (Yang et al., 2017;Novák et al., 2020). The length of stay in the hospital could also serve as a prognostic indicator and could be influenced by conditions such as individual differences, primary diseases and different treatment strategies. Moreover, the longer the patients stayed in the hospital, the higher the hospitalisation costed (Lü et al., 2018;Zhang et al., 2020b). PCT is a biomarker for infections and can also serve as a promising prognostic indicator in patients with Candida and bacterial infections. Some studies have attempted to differentiate between candidemia and bacteraemia based on PCT levels; however, the differentiation remains challenging (Cortegiani et al., 2019;Honore et al., 2020). Besides, the lymphocyte count was a prognostic indicator of infection and was associated with mortality (Hatinguais, Willment & Brown, 2020;Zhang et al., 2020a).
However, this study has some limitations. First, the data was collected from a single-centre medical database; therefore, the results may be affected by geographical differences, specific management of hospitals and regional policies. Second, limited samples and regional differences may influence the outcomes of machine learning. Therefore, multi-centre studies should be conducted to further explore the epidemiological features and prospective risk factors.

CONCLUSIONS
In this study, the most common Candida and bacterial species found in patients with concomitant candidemia and bacteraemia in the First Affiliated Hospital of China Medical University were Candida parapsilosis and Acinetobacter baumannii, respectively. Serum creatinine level, endotoxic shock, length of stay in ICU, age, leukocyte count, total parenteral nutrition, total bilirubin level, length of stay in the hospital, PCT level and lymphocyte count were identified as the main prognostic factors of death. So far as is known, the types of Candida infection and bacterial infection are highly regional. There are few studies on candidemia and bacteremia in China, and the studies in this field in Northeast China remain lacking. Our research fills the gap in this part.